weakly supervised
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- North America > Canada (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Paper describes a method to identify image patches that are (a) diagnostic of particular objects (b) not particularly redundant and (c) cover well the collection of diagnostic patches. The method applies to the weakly supervised case, where images are known to contain the object(s) of interest, but the location of these objects is not known. This is a very well studied topic. Once these patches have been identified, related pairs are found by a mining process.
- Europe > France (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Ukraine (0.04)
- (2 more...)
Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection Supplementary Material
Bottom: CASD overlaid with attentions. Recall that WSOD conducts classification on object proposals (e.g., bounding boxes generated by Selective Search [ Figure 1 shows both the success and the failure cases of CASD. This could be improved by hard-sample mining in CASD training. This localization advantages of CASD benefit from its learning of comprehensive attention (see the bottom row of Figure 1). CorLoc only evaluates the localization accuracy of detectors.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- North America > Canada (0.05)
Boosting Weakly Supervised Referring Image Segmentation via Progressive Comprehension
This paper explores the weakly-supervised referring image segmentation (WRIS) problem, and focuses on a challenging setup where target localization is learned directly from image-text pairs. We note that the input text description typically already contains detailed information on how to localize the target object, and we also observe that humans often follow a step-by-step comprehension process (\ie, progressively utilizing target-related attributes and relations as cues) to identify the target object. Hence, we propose a novel Progressive Comprehension Network (PCNet) to leverage target-related textual cues from the input description for progressively localizing the target object.Specifically, we first use a Large Language Model (LLM) to decompose the input text description into short phrases. These short phrases are taken as target-related cues and fed into a Conditional Referring Module (CRM) in multiple stages, to allow updating the referring text embedding and enhance the response map for target localization in a multi-stage manner.Based on the CRM, we then propose a Region-aware Shrinking (RaS) loss to constrain the visual localization to be conducted progressively in a coarse-to-fine manner across different stages.Finally, we introduce an Instance-aware Disambiguation (IaD) loss to suppress instance localization ambiguity by differentiating overlapping response maps generated by different referring texts on the same image. Extensive experiments show that our method outperforms SOTA methods on three common benchmarks.
TeD-Loc: Text Distillation for Weakly Supervised Object Localization
Murtaza, Shakeeb, Belharbi, Soufiane, Pedersoli, Marco, Granger, Eric
Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.
- North America > United States > California (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Research Report > Promising Solution (0.66)
- Research Report > New Finding (0.46)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition
Belharbi, Soufiane, Pedersoli, Marco, Koerich, Alessandro Lameiras, Bacon, Simon, Granger, Eric
Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (AUs) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. During training, this AU codebook is used, along with the input image expression label, and facial landmarks, to construct a AU heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with AU heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with AU maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks RAF-DB, and AffectNet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.
Weakly Supervised Pretraining and Multi-Annotator Supervised Finetuning for Facial Wrinkle Detection
Moon, Ik Jun, Moon, Junho, Jang, Ikbeom
Analyzing extensive collections of images can be exceedingly resource-intensive if each facial wrinkle must be individually assessed. Moreover, the subjectivity inherent in manual segmentation processes can diminish the reliability of research findings and pose a substantial issue. To address this issue, we effectively combine wrinkle data labeled by multiple annotators to minimize inter-rater variability and utilize these image-label pairs for training our model.
- Asia > South Korea > Seoul > Seoul (0.06)
- Asia > South Korea > Ulsan > Ulsan (0.05)